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Abstract 

Partial monitoring is a generic framework for sequential decision-making with incomplete 
feedback. It encompasses a wide class of problems such as dueling bandits, learning with 
expect advice, dynamic pricing, dark pools, and label efficient prediction. We study the 
utility-based dueling bandit problem as an instance of partial monitoring problem and 
prove that it fits the time-regret partial monitoring hierarchy as an easy - i.e. 0(^/^) - 
instance. We survey some partial monitoring algorithms and see how they could be used 
to solve dueling bandits efficiently. 

Keywords: Online learning. Dueling Bandits, Partial Monitoring, Partial Feedback, 

Multiarmed Bandits 


1. Introduction 


Partial Monitoring (PM) provides a generic mathematical model for sequential decision¬ 
making with incomplete feedback. It is a recent paradigm in the reinforcement learn¬ 
ing community. Similarly the multi-armed bandit problem is a classical mathematical 
model for the exploration/expl oitation dilemma inherent in reinforce ment learning (see 
Bubeck and Cesa-Bianchi . 20121 ). The K-armed dueling bandit problem ( Yiie and Joachims . 
20091 ! is a variation of the multi-armed bandit problem where two arms are selected at each 
round with a relative feedback. 

Several generic parti al monitoring algori thms have been proposed for both stochastic and 


adversarial se ttings (see lBartok et al.l . l2014l . for details). With the exception of globalexpS 


BartokI (l201.‘ll ! which tries to capture the structure of the games more finely, these algorithms 
only focus on the time bound and perform inefficiently in term of the number of actions. 
As we show in section [5l for a dueling bandit problem, the number of actions is quadratic 
in the number of arms K and these algorithms, including globalex p3, provide at best a 


O (aVt) regret guarantee whereas a dedicated algorithm like rex3 ( Gaiane et ah . 20151 ) 
can provide a O (^\/KT^ guarante^. Studying partial monitoring algorithms from the 
perspective of dueling bandits is hence an interesting and challenging problem which could 
help us improve the ability of PM algorithms to capture the structure of sequential decision 
problems in a better way. 


1. The O {■) notation hides logarithmic factors. 
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In this preliminary work, we investigate how a utility-based dueling bandits problem 
can be modeled as an instance of a partial monitoring game. Our main contribution is 
that, we prove, usi ng the PM fornialism , that it is an easy PM instance according to the 
hierarchy defined in iBartok et al.l (j2014l i. Furthermore, we take a brief look at the existing 
partial monitoring algorithms and examine how they could be used to solve dueling bandits 
problems efficiently. 


1.1 Dueling bandits 


The K-armed dueling bandit problem is a vari ation of the classical multi-armed bandit- 
problem introduced by Yue and Joachims ( 20091 ) to formalize the exploration/exploitation 
dilemma in learning from preference feedback. In its utility-based formulation, at each time 
period, the environment sets a bounded value for each of the K arms and simultaneously 
the learner selects two arms. The learner only sees the outcome of the duel between the 
selected arms (i.e. the feedback indicates which of the selected arms has better value) and 
receives the average of the gains of the selected arms. The goal of the learner is to maximize 
her cumulative gain. 

Relative feedback is naturally suited to many practical applications because users are 
more obliging to provide a relative preference feedback rather than an absolute feedback e.g. 
compared to “I rate Tennis at 32/50 and Football at 48/50” (absolute feedback) , it’s easier 
for users to say “I like Football more than Tennis” (relative feedback). Information Retrieval 
systems with implicit feedback are another important application of the dueling bandits (see 
Radlinski and Joachims . 2007l b The major difficulty of the dueling bandit problem is that 
the learner cannot directly observe the loss (or gain) of the selected actions. To capture this 
aspect of the problem, it can be modeled as an in stance of the partial monitoring problem 
as defined by Piccolboni and Schindelhaueil ( 2001 1. 


1.2 Partial monitoring games 

A partial monitoring game is defined by a tuple {N, M El where N, M, and S 

are the action set, the outcome set, and the feedback alphabet respectively. To each action 
I € N and outcome J e M, the loss function L associates a real-valued loss T(/, J) and the 
feedback function PL associates a feedback symbol Pl{I,J) e S. 

In every round, the opponent and the learner simultaneously choose an outcome Jt from 
M and an action p from AI, respectively. The learner then suffers the loss C{It,Jt) and 
receives the feedback PL{It,Jt)- Only the feedback is revealed to the learner, the outcome 
and the loss remain hidden. In some problems, gain Q is considered instead of loss. The loss 
function L and the feedback function PL are known to the learner. When both N and AT 
are finite, the loss function and the feedback function can be encoded by matrices, namely 
loss matrix and feedback matrix each of size |Ar| x |AT|. The aim of the learner is to control 
the expected cumulative regret against the best single-action (or pure) strategy at time T: 

T 

Rt = max^T(/t, Jt) -C{i,Jt) 

® t=i 

2. Uppercase boldface letters are used to denote sets 
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Various interesting problems can be modeled a s parti al monitoring games, such as learn- 
i ng with expect ad vice (jLittlestone and WarmuthI (ll994lV. the multi- armed bandit problem 
( Auer et all ( 200211 . dyna mic pricing (IKleinberg and LeightonI (2nn.'llib the dar k pool prob¬ 
lem ( Agarwal et al. ( 2O10l ii. label efficient prediction ( Cesa-bianchi et al. ( 2005)), and linear 
and c oi ivex optimization wit h full or bandit feedback ( Zinkevich ( 200,*1 ir rAbernethv et al 


(l2008l ). iFlaxman et ^ (l2004l )). We shall briefly explain a couple of examples: 


The dynamic pricing problem: A seller has a product to sell and the customers wish 
to buy it. At each time period, the customer secretly decides on a maximum amount she is 
willing to pay and the seller sets a selling price. If the selling price is below the maximum 
amount the buyer is willing to pay, she buys the product and the seller’s gain is the selling 
price she fixed. If the selling price is too expensive, her gain is zero. The feedback is 
partial because the seller only recieves a binary information stating whether the customer 
has bought the product or not. A PM formulation of this problem is provided below: 


xeAIcR, i/eMcR, 5] = { “sold”, “not sold” } 


G{x,y) 


jo, ifx>y, 
[x, ifx<y, 


n{x,y) 


“not sold”, if X > y, 
“sold”, if X < y. 


The multi-armed bandit problem: At each time period, the learner pulls one of the 
K arms and receives it’s corresponding gain which is bounded in [ 0 , 1 ]. The learner sees 
only her gain and not the gain of other arms. The learner’s goal is to win almost as much 
as the optimal arm. A partial monitoring formulation of this problem is provided with a 
set of K arms/actions i € N - {1,... ,K}, an alphabet 'S = [ 0 , 1], and a set of environment 
outcomes which are vector^ m e M = [ 0 , 1 ]^. The entry with index i (mj) denotes the 
instantaneous gain of the arm. Assuming binary gains, M is finite and of size 2^. 


G{i, m) = rrii m) = rrii 


2. Dueling bandits as a Partial Monitoring game 

The utility-based dueling bandits model is similar to multi-armed bandits but the action sets 
differ. An action consists here of selecting a pair (i,y ) of arms. However, symmetric actions 
like (i,j) and (j, i) lead to the same gains and provide equally informative feedback. Hence 
the action set for the learner can be restricted to N = {(i,j) ■ 1 <i,j < K,i < j}. When the 
environment selects an outcome m e M and the learner selects a duel/action {i,j) e N, 
the instantaneous gain and feedback are as follows: 




rrii + 'ixij 
2 



□ 

if 

rrii 

<mj 

(loss 


o 

if 

rrii 

= mj 

(tie) 


m 

if 

rrii 

>mj 

(win^ 


To illustrate this formalism, we encode a 4-armed binary-gain dueling bandit problem as 
a PM problem in Figured] The first element of every column is of the form mim 2 m 3 m 4 
where rrii is the gain for arm. The first element of every row is of the form did 2 where 
di is the hrst arm being picked and d 2 being the second. 


3. Lowercase boldface letters are used to denote vectors 
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Figure 1: Gain matrix Q and feedback matrix T-L for a 4-armed binary dueling bandits 
resulting in 10 non-duplicate actions and 16 possible outcomes. 
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Figure 2: Signal matrix for action (12) for the same problem as in Figured) 
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3. Hierarchy and basic concepts of partial monitoring problems 


In this section, firstly, we take a brief review of the basic conce pts of partial monit oring 
problems. M ost of the definitions in this section are taken from iBartok et all (j201ll ) and 
Bartok ( 2013l b 


Consider a finite partial monitoring game with action set N, outcome set M, loss 
matrix H and feedback matrix 'H. For any action i e Af, loss vector li denotes the column 
vector consisting of row in C. Correspondingly, gain vector denotes the column vector 
consisting of row in Q. For the rest of the article, gain vector and loss vector Zj will 
be used interchangeably depending upon the setting. Let be the |iH|-1-dimensional 
probability simplex i.e. ^\m\ = {q ^ | ||q||i = l}. For any outcome sequence of 

length r, the vector q denoting the relative frequencies with which each outcome occurs is 
in A|^|. The cumulative loss of action i for this outcome sequence can hence be described 
as follows: 

t=i 


The vectors denoting the outcome frequencies can be thought of as the opponent strategies. 
These opponent strategies determine which action is optimal i.e. the action with the lowest 
cumulative loss. This induces a cell decomposition on A|^|. 

Definition 1 (Cells) The cell of an action i is defined as 

Cj = |q e A\m\ I l]q = 

In other words, a cell of an action consists of those opponent strategies in the probability 
simplex for which it is the optimal action. An action i is said to be Pareto-optimal if 
there exists an opponent strategy q such that the action i is optimal under q. The actions 
whose cells have a positive (|iH|-l)-dimensional volume are called Strongly Pareto-optimal. 
Actions that are Pareto-optimal but not strongly Pareto-optimal are called degenerate. 

Definition 2 (Cell decomposition) The cells of strongly Pareto-optimal actions form a 
finite cover of Am called as the cell-decomposition. 

Two actions cells i and j from the cell decomposition are neighbors if their intersection is 
an {\M\ - 2)-dimensional polytope. The actions corresponding to these cells are also called 
as neighbors. The raw feedback matrices can be ‘standardized’ by encoding their symbols 
in signal matrices: 

Definition 3 (Signal matrices) For an action i, let ai ,... ,(Ts; e S be the symbols occur¬ 
ring in row i of PL. The signal matrix Si of action i is defined as the incidence matrix of 
symbols and outcomes i.e. Si{k,m) = \H{i,m) = ( Tk \ A:=l,...,Si, /or m e AT 0. 

Observability is a key notion to assess the difficulty of a PM problem in terms of regret Rt 
against best action at time T. 

4. we use J-] to denote the indicator function 
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Definition 4 (Observability) For actions i andj, we say thatli-lj is globally observable 
ifli-lj e Im5^. Where the global signal matrix S is obtained by stacking all signal matrices. 
Furthermore, if i and j are neighboring actions, then li - Ij is called locally observable if 
li -Ij e where the local signal matrix Sij is obtained by stacking the signal matrices 

of all neighboring actions for i,j: Sk for k € {k € N \ Ci n Cj ^ Ck}- 

Theorem 1 (Classification of partial monitoring problems) Let {N ,M C,!^) be 

a partial monitoring game. Let {Ci,... ,Ck} be it’s cell decomposition, with corresponding 
loss vectors li,... ,lk- The game falls into the following four regret categories. 

• Rt = 0 if there exists an action with Ci = This case is called trivial. 

• Rt e 0(T) if there exist two strongly Pareto-optimal actions i and j such that U - Ij 
is not globally observable. This case is called hopeless. 

• Rt e 0(\/T) if it is not trivial and for all pairs of (strongly Pareto-optimal) neigh¬ 
boring actions i and j, li - Ij is locally observable. This case is called easy. 

• Rt e 0(T^/^) if^ is not hopeless and there exists a pair of neighboring actions i and 
j such that li - Ij is not locally observable. This case is called hard. 


4. Dueling bandits in the partial monitoring hierarchy 


This section examines the place of the dueling bandit problem in the hierarchy of par- 
tial monitoring prob lems described above. Note that the existence of the rex3 algorithm 


( Gaiane et all . 120151 1 with a 0 (V KT^ regret guarantee is enough to state that dueling 


bandit is an easy game according to the hierarchy described in Theorem [TJ but our aim here 
is to retrieve this result from the PM machinery. 


Theorem 2 (Duelings bandits: locally observable) In a binary utility-based dueling 
bandit problem with more than two arms, all the pairs of actions are locally observable. 

Proof Consider a dueling bandit problem as defined in Section [2] with binary gains and 
K > 2 arms. The signal matrix of any action {i,j) e Af is defined as follows: 

'S'fij)(□,"*-) = {mi < raj\, 5(ij)(o,m) = [[mj = mj, 5(jj)(B,m) = [[mj > m^] 

In the following, we show that for any pair of actions {i,j) and {i',j'), g ( i ’ j ') ~ 9 { i ', j ’) is 
locally observable. For the sake of readability, let’s consider S", S'* and S° to be the column 
vectors containing the rows pertaining to the symbols ■, o and □ of the signal matrix S 
respectively. We consider the following two cases for the pair of actions which together 
cover all the possibilities: 

• A pair of actions that share at-least one common arm: 
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1. Actions {i,k) and {k,j). For any binary gain outcome m, we have : 

(mi + rrik ruk + mj \ 

= 0.5 ([mj > rrijj - {rrij > 

= -«“.)) ( 1 ) 

So, 9(i,k) ~ 9(k,j) in the row space of the signal matrix of the action {i,j) 

and hence in the row space of the signal matrix of the neighborhood action set. 
(refer definition [3|) 

2. Actions {i,k) and {j,k). Similarly, -g^j^k) = 

• No common arm i' j 4^ j'): In this case, 

^(kj) ~ 9{i',j') = 9(i,j) - 9(i,j') + 9{i,j') - 9{i',j') 

= " ‘^SdO + equation O 


Hence, for any pair of actions (i,j) and g(ij) -g{i'j') falls in the row space of 

the signal matrix of the neighborhood action set i.e. g(ij) ~9{i',j') ^ j){i' j')) 

therefore it is locally observable. So, by extension, the binary dueling bandit problem is 
locally observable and hence we arrive at the following corollary. ■ 


Corollary 3 According to the hierarchy described in theoremUl the binary dueling bandit 
problem is easy and its regret is 0(\/T). 


5. Partial monitoring algorithms and their use for dueling bandits 


FEEDEXP3 bv iPiccolboni and Schindelhauerl (j200lh was the first algorithm for finite par¬ 
tial monitoring games. For its application, there is an important pre-condition - existence 
of a matrix B such that BH = Q. We prove by contradiction that such a matrix B doesn’t 
exist for the dueling bandit problem. Let’s assume B exists. Therefore, for any action 
(i, j) e N and any outcome vector m e M, 


G{{hj),fn) 


K 

E 






Consider m = 0... 0, i.e. the gain of every arm is 0. In this case, the gain of any action 
(z,j) is 0 and the feedback for every action is o, therefore 


0 = 


K 


E 




■ O 


( 2 ) 


Now consider m= 1... 1, i.e. the gain of every arm is I. In this case, the gain of any action 
(i,j) is 1 and feedback of every action is o, therefore 


1 = 


K 


E 




( 3 ) 


7 







Gajane 


Eq. [2] and eq. [3] reach a contradiction, therefore our assumption that B exists is incor¬ 
rect. Fortunately, the authors also provide a general algorithm which performs several 
matrices trans formations to sidestep this pre-condition. These transformations are studied 
thoroughly in ( Bart(^ . 2012 1. _ 

BALATON by Bartok et al. ( 2011 1. CBP-vanilla and CBP by Bart ok ( 2012 ) belong 
to the fam il y of a lgorithms for the locally observable PM games as does GLOBAL-EXP3 
bv iBartokI (l201.3l l. Although, for GLOBAL-EXP3, its regret bound of 0{\/N'T) does 
not directly depend on the number of actions, but rather on the structure of games as N' 
is the size of the largest point-local game. We can however provide a counter-example for 
utility-based dueling bandits where N ' ^ in the following way. 

We use the notations from Bartok ( 20131 ). Consider a p in the probability simplex A|^| 
where all the arms have maximal gains. For this p, all the actions are optimal therefore 
this point b e longs to all the cells in the cell-decomposition. Hence, according to definition 
6 in iBartd^ ( 20131 ). there exists a point-local game consisting of all the K{K + l)/2 non¬ 
duplicate actions. Therefore the upper bound of GLOBALEXP3 translates to 0{K\/T) 
for utility-based dueling bandits. 

The following table summarizes the salient features of these PM algorithms. We can 
clearly see that none of them, except REX3, is optimal with respect to the number of 
actions N. Please note that for the dueling bandits problem, N pa K^. 


Table 1: Summary of PM algorithms 


Algorithm 


Setting 

Optimality 

Regret 

FEEDEXP3 fPiccolboni and Schindelhauer I 

^200111 

Adversarial 

Not in T or A 

0{T‘^I^K) 

BALATON iBartok et al. ("2011)) 


Stochastic 

Not in T or A 

o{kVt) 

GBP ('Bartok ('2012)) 


Stochastic 

in T, not in A 

0{K‘HogT) 

GLOBAL-EXP3 IBartok 1201311 


Adversarial 

in r, not in A 

o{kVt) 

SAVAGE lUrvov et al. 1201311 


Stochastic 

in T, not in A 

0{KHogT) 

Neighborhood Watch iFoster and Rakhlin 

1201111 

Adversarial 

in T, not in A 

o{kVt) 

REX3 IGaiane et al. 120151) 


Adversarial 

in T and A 

ol^jKT) 


6. Conclusion 

In this article, we studied the dueling bandit problem as an instance of the partial monitoring 
problem. We proved that the binary dueling bandit problem is a locally observable game 
and hence falls in the easy category of the partial monitoring games. We also looked at 
the some of the existing partial monitoring algorithms and their optimality with respect to 
both time and the number of actions. 
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Notation 

Table 2: Notation table 

Description 

K 

Number of arms 

t 

Time index 

T 

Time horizon 

Rj' 

Cumulative regret after time T 

E.^(...) 

Expectation according to vr 

N 

set of actions 

M 

set of outcomes 

m 

outcome vector e M 

C 

loss function/matrix 

Q 

gain function/matrix 

n 

feedback function/matrix 

u 

loss vector: column vector consisting of row in C 

Qi gain vector: column vector consisting of row in Q 

Ci 

Cell of action i 

size of set . 

M 

Set of real numbers 
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